# Setting Up SQLite Database

Instructions for creating, populating, and querying an SQLite database for arXiv images.


# Table of Contents

1.  [Setting Up SQLite Database](#orga37bb72)
    1.  [SQLite database](#org394ad8b)
        1.  [Create SQLite database](#org30737b4)
        2.  [Inserting article metadata into database](#orga7b5a53)
        3.  [Image metadata](#orgf4fd133)
        4.  [Additional image metadata](#org6f35520)
        5.  [Caption metadata](#org0fe5233)
        6.  [Top results for "creator" metadata](#org40b65be)
    2.  [Data examples](#org22d9ee9)
        1.  [Metadata](#org503a15f)
        2.  [Images](#org27f4773)
        3.  [Captions](#org02fac97)
    3.  [Cleaning and fixes](#org90c892b)
    4.  [Querying](#orgbbb2bed)
        1.  [Building list of all filepaths from SQLite database](#org9c39dd9)


<a id="org394ad8b"></a>

## SQLite database


<a id="org30737b4"></a>

### Create SQLite database

In order to index the images with their associated article metadata, we created a SQLite database with three tables: `metadata` with rows for each article; `images` with rows for each image file; and `captions` for figure captions, labels, and associated image ids. SQLite was chosen as a database format because of its widespread use and the simplicity of a lightweight, single-file database in contrast to client-server database formats. SQLite performs well within the expected number of entries (tables/columns/rows). SQLite is also a reasonably common format and has command line, R, and Python interfaces, and can also be read by other common data science tools such as the Python Data Analysis Library, a.k.a. pandas.

Python script to create SQLite database:

```bash
python create-db/create_sqlite_db.py
```

Details of table creation schema in the script:

```python
# create metadata table
c.execute('''
    CREATE TABLE metadata(id INTEGER PRIMARY KEY, identifier TEXT, created TEXT, \
    cat TEXT, authors TEXT, title TEXT, abstract TEXT, licence TEXT)
''')

# create images table
c.execute('''
    CREATE TABLE images (id INTEGER PRIMARY KEY, identifier TEXT, filename TEXT, \
    filesize INT, path TEXT, x INT, y INT, imageformat TEXT, creator TEXT)
''')

# create captions table
c.execute('''
    CREATE TABLE "captions" ("id" INTEGER, "identifier" TEXT, "tex" TEXT, \
    "fignum" TEXT, "caption" TEXT, "label" TEXT, "filenames" TEXT, "image_ids" TEXT, PRIMARY KEY("id"))
''')
```


<a id="orga7b5a53"></a>

### Inserting article metadata into database

We used a set of Python scripts to then iterate over the XML files and insert rows into the database for each. For this step we used the Python ElementTree XML API to navigate the required XML files and extract the desired metadata. This required custom code that was tailored to the specific XML format as it is stored on <https://arxiv.org> and downloaded by metha (which groups XML documents together to reduce the number of files).

For our purposes, we used a primary key of a unique number, and inserted the identifier, date created, categories, authors, title, abstract, and licence.

The `oai_to_sqlite.py` script accesses a folder of metha-downloaded OAI XML files.

```bash
usage: oai_to_sqlite.py [-h] [-v] db_path oai_path

Parse metha OAI XML files and insert metadata into SQLite database

positional arguments:
  db_path        path to SQLite database
  oai_path       set folder of OAI xml files

optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose  verbose output
```

Example usage

```bash
python create-db/oai_to_sqlite.py ~/data/db/arxiv_db.sqlite ~/data/oai/metha/
```


<a id="orgf4fd133"></a>

### Image metadata

In order to add rows of image data to the database, we searched and iterated over all of the directory structure of the arXiv bulk downloads and gathered metadata directly from each file. We found that an effective way to do this was to use the unix `find` command to write all image file paths to a text file, then using those paths to run an identify command (from ImageMagick) to write a number of details to the database. We collected the article identifier, filename, filesize, filepath, x dimension size, y dimension size, and imageformat. This took a number of days to complete, but could be optimised using faster storage media and/or parallel processing.

This is done as a two step process because it will take a long time and it is helpful to be able to restart the process partway.

```bash
# first we need to get the paths of all the image files
arXiv-src-scripts/image_paths_to_txt.sh SOURCE_DIR TARGET_FILE
# this will take a little while

# then use this paths text file to get each image metadata and write into SQL
# this will also take a while
create-db/image_data_to_sql_paths.sh START_LINE PATHS_FILE DATABASE_FILE
```

Example usage

```bash
arXiv-src-scripts/image_paths_to_txt.sh ~/arXiv/src ~/data/paths/all_image_paths.txt
create-db/image_data_to_sql_paths.sh 0 ~/data/paths/all_image_paths.txt ~/data/db/arXiv_db.sqlite
```


<a id="org6f35520"></a>

### Additional image metadata

Additional metadata can be procured from the individual files by accessing the Exif (Exchangeable image file format). Although this could be done at the same time as inserting rows for the images into the database table, we did this at a later stage.

Example usage:

```bash
python create-db/imagemeta_to_sql_threads.py ~/data/db/arxiv_db_images.sqlite3 ~/arXiv/src_all/
```

This uses the `exiftool` command to check each file for metadata relating to "software" or "creator". Depending on the file extension, different fields are accessed. These were determined through testing `exiftool` across a range of file formats and checking which fields might relate to the software used to create or process the images. This is stored in a single column for simplicity, although the metadata may refer to different fields. See the following Python excerpt for the relation between extension and Exif field accessed:

```python
n = filename.lower()

if n.endswith(('.eps', '.ps', 'pstex', '.epsf', '.epsi')):
    field = "Creator"
elif n.endswith(('.png')):
    field = "Software"
elif n.endswith(('.pdf')):
    field = "Creator"
elif n.endswith(('.jpg', 'jpeg')):
    field = "Software"
elif n.endswith(('.gif')):
    field = "Comment"
elif n.endswith(('.svg')):
    field = "Desc"
```

Note that a fairly high proportion of images do not have this metadata, which will typically write a single hyphen ("-") to the database.


<a id="org0fe5233"></a>

### Caption metadata

The `captions` table was created by iterating through all `.tex` files and parsing the tex source for figures, captions, and filenames. Our method uses a number of different regular expressions to find particular key parts of the source. This code is specific to this dataset and unlikely to generalise.

We grabbed the caption for each figure (if there was one), writing the following fields to the database table: unique caption identifier, arXiv identifier, tex file path, figure number, caption text, figure label, filenames of any images used to create that figure (can be more than 1), and image ids of these images from the `images` table.

First we iterated through the entire folder structure and found any tex files, writing these into a JSON file.

```bash
usage: get_tex_list.py [-h] search_path output_file

Parse metha OAI XML files and insert metadata into SQLite database

positional arguments:
  search_path  set folder to search
  output_file  name of output file (JSON)

optional arguments:
  -h, --help   show this help message and exit
```

Example usage

```bash
python create-db/get_tex_list.py ~/arXiv/src_all/ all_texs.json
```

Then we pulled all figure captions from each of these `.tex` files, parsed the output using regular expressions, and wrote the results to the database.

```bash
usage: get_captions.py [-h] [--start_line START_LINE] [-v] [-t] [-z] [-r] [-n]
		       [-i] [--create_index]
		       db_path tex_list

Script for getting captions from .tex files

positional arguments:
  db_path               path to SQLite database
  tex_list              path to file that stores list of all .tex files

optional arguments:
  -h, --help            show this help message and exit
  --start_line START_LINE
			line to read textfile from (default: 0)
  -v, --verbose         verbose output
  -t, --timing          timing output
  -z, --dryrun          don't modify the database, just print (default: False)
  -r, --shuffle         shuffle the list of tex files (default: False)
  -n, --no_captions     do not write captions into db (default: False)
  -i, --get_images      get the image_ids for each filename (default: False)
  --create_index        create an index to speed up queries (default: False)
```

Example usage

```bash
python create-db/get-captions.py ~/data/db/arxiv_db_images.sqlite3 tex_paths_src_update.json 2>&1 | tee caption_update_log.txt
```

Finally, we went back over all the text files and grabbed the associated `image_ids` from the `images` table so that the figure captions can be cross-referenced with the images. This is also done using `get_captions.py` but with different arguments:

```bash
python get-captions.py ~/data/db/arxiv_db_images.sqlite3 tex_paths_src_update.json -i -n --create_index 2>&1 | tee caption_update_indexes_log.txt
```


<a id="org40b65be"></a>

### Top results for "creator" metadata

```org
|------------------+---------+-------|
| "creator"        |   total |     % |
|------------------+---------+-------|
| (none)           | 1997457 | 19.87 |
| MATLAB           |  876177 |  8.72 |
| Mathematica      |  492318 |  4.90 |
| matplotlib       |  491001 |  4.88 |
| IDL              |  404852 |  4.03 |
| gnuplot          |  396484 |  3.94 |
| cairo            |  388108 |  3.86 |
| fig2dev          |  349381 |  3.48 |
| SM               |  268902 |  2.67 |
| ROOT             |  265278 |  2.64 |
| Illustrator      |  263934 |  2.63 |
| Grace            |  237719 |  2.36 |
| dvips            |  232165 |  2.31 |
| TeX              |  209613 |  2.09 |
| GIMP             |  207108 |  2.06 |
| Ghostscript      |  199064 |  1.98 |
| OriginLab        |  168350 |  1.67 |
| HIGZ             |  144720 |  1.44 |
| R                |  143164 |  1.42 |
| PGPLOT           |  128704 |  1.28 |
| ImageMagick      |  123697 |  1.23 |
| CorelDRAW        |   91453 |  0.91 |
| jpeg2ps          |   87546 |  0.87 |
| PScript5         |   77136 |  0.77 |
| Photoshop        |   76648 |  0.76 |
| Acrobat          |   72191 |  0.72 |
| PowerPoint       |   50187 |  0.50 |
| XV               |   47320 |  0.47 |
| Ipe              |   43498 |  0.43 |
| Keynote          |   37964 |  0.38 |
| xmgr             |   37831 |  0.38 |
| PSCRIPT          |   36755 |  0.37 |
| inkscape         |   32036 |  0.32 |
| OmniGraffle      |   30788 |  0.31 |
| LaTeX            |   30473 |  0.30 |
| Preview          |   24770 |  0.25 |
| GraphicConverter |   24124 |  0.24 |
| FreeHEP          |   23621 |  0.23 |
| GTVIRT           |   20680 |  0.21 |
|------------------+---------+-------|
```


<a id="org22d9ee9"></a>

## Data examples


<a id="org503a15f"></a>

### Metadata

```org
|      id |    identifier |    created | cat             | authors                                                    | title                                                | abstract                                                                        | licence                                             |
| 1038521 | hep-ph0107222 | 2001-07-20 | hep-ph          | ['Yang, Jian-Jun; ']                                       | Up and Down Quark Contributions...                   | We check the...                                                                 |                                                     |
| 1235851 |     0912.5313 | 2009-12-29 | math.CV math.AG | ['Catanese, Fabrizio; Oguiso, Keiji; Peternell, Thomas; '] | On volume preserving complex structures on real tori | A basic problem in the classification theory of compact complex manifolds is... | http://arxiv.org/licenses/nonexclusive-distrib/1.0/ |
| 1214856 |     1308.0124 | 2013-08-01 | hep-ph hep-th   | ['Rose, Luigi Delle; ']                                    | The Standard Model in a Weak Gravitational...        | The principal goal of the physics of the fundamental interactions is...         | http://arxiv.org/licenses/nonexclusive-distrib/1.   |
```


<a id="org27f4773"></a>

### Images

```org
|      id | identifier | filename         | filesize | path                                                                                                      |    x |   y | imageformat | creator |
| 4876126 |  cs0007002 | gouala05.eps     |   145239 | ./0007/cs0007002                                                                                          |  663 | 300 | PS          |         |
| 2209549 |  0906.0725 | belleescan_b.eps |   842045 | ./0906/0906.0725                                                                                          | 1450 | 725 | PS          |         |
| 6591348 | 1710.10269 | HAT-P-12.pdf     |    78468 | ./1710/1710.10269/figures/figures_from_umserve/chemistry/abundance_change_with_grid_parameter/metallicity |  566 | 406 | PDF         |         |
```


<a id="org02fac97"></a>

### Captions

```org
|      id | identifier | tex                                                         | fignum | caption                                                                                                                                                   | label                         | filenames                       | image_ids |
| 7066264 | 1905.03030 | /arXiv/src_update/1905/1905.03030/metalearning.tex |      4 | Meta-learned state machine for a predictor of coin tosses. The figure shows...                                                                            | fig:metalearned-state-machine | figures/coin_prediction_memory  |  12007157 |
| 3294070 | 1611.05676 | /arXiv/src_all/1611/1611.05676/tryQFPT.tex         |      7 | $F_n$ versus $n$ for the rational  sampling time   $\gamma \tau /\pi=1/3$. Now the detection probability...                                               | fig1patha                     | inf_lattice_tau_03pi_comparison |   5413659 |
| 3902503 | 1810.07940 | /arXiv/src_all/1810/1810.07940/Astrum.tex          |      1 | The dependence of CR modulation for TRAPPIST 1d for different values of stellar magnetic field and stellar wind velocity 545~km/s ($T_{\text{cor}}=2$~MK) | fig1                          | Fig1                            |   6443166 |
```


<a id="org90c892b"></a>

## Cleaning and fixes

The database required some cleaning after inserting rows for both metadata and images. Please refer to the Jupyter Notebook `sqlite-scripts/db_fixes.ipynb`

This may be due to article revisions or the metadata harvester missing a few entries, as well as accumulating duplicates if the arXiv OAI data was updated on the server (metha does not sync previous files by default). A small number of images did not have corresponding article metadata. To fix this we used the list of articles without metadata and queried the arXiv OAI server. We then inserted these rows into the metadata table. This has been automated in the notebook.

We also searched the SQLite database for any entries with special characters such as '/' that caused errors in the file insertion. In each arXiv category pre-2007, we removed the forward slash as this could potentially cause problems in how this data is read (this can be added back in as required).

From the list of entries with special characters, many of these images were duplicated within the source upload. We checked that the images were present in the dataset and then manually removed the rows from the SQLite database as well as the files from the dataset. For later searches, we also filtered out any images that have an X or Y dimension of 0 or NULL, indicating that they could not be ready by the Imagemagick identify command.


<a id="orgbbb2bed"></a>

## Querying

Once both tables have been created, it is then possible to perform SQL queries with a left join to pair the associated metadata with a given image. This allows us to create queries and perform analyses that would not have been possible with only the bulk download: accessing the image data according to different metadata such as subject categories or date, performing searches, and analysing the image content of the dataset in various ways. The SQLite database provides a convenient and flexible way to perform these queries across the ~10 million images and ~1.5 million articles.

```sqlite
SELECT images.identifier, metadata.cat,count(images.identifier)    
FROM images
LEFT JOIN metadata ON images.identifier = metadata.identifier
GROUP BY images.identifier
ORDER BY count(images.identifier)
```

For more examples see the Jupyter Notebook `sqlite-scripts/db_queries.ipynb`


<a id="org9c39dd9"></a>

### Building list of all filepaths from SQLite database

Use the `sqlite-scripts/get_all_image_paths.py` script to read all entries from the SQLite database, build a list of file paths, and write this into a text file. Although we have already produced such a text file from the file structure, this updated version is useful as it allows us to use the SQLite database as a reference for all images from this point forward. This is for ease of use when running other scripts and especially for running scripts that might take a very long time or fail, so that it can be restarted at a known point.

```bash
usage: get_all_image_paths.py [-h] database textfile

Script for writing a text file with all paths from SQLite database

positional arguments:
  database    path to database
  textfile    textfile to write file paths to

optional arguments:
  -h, --help  show this help message and exit
```

Example usage

```bash
python sqlite-scripts/get_all_image_paths.py ~/data/db/arxiv_db.sqlite all_image_filepaths_sqlite.txt
```